A framework to Deal with Missing Data in Data Sets

نویسندگان

  • Luai Al Shalabi
  • Mohannad Najjar
  • Ahmad Al Kayed
چکیده

Most information systems usually have some missing values due to unavailable data. Missing values minimizing the quality of classification rules generated by a data mining system. Missing vales also affecting the quantity of classification rules achieved by the data mining system. Missing values could influence the coverage percentage and number of reducts generated. Missing values lead to the difficulty of extracting useful information from that data set. Solving the problem of missing data is of a high priority in the field of data mining and knowledge discovery. Replacing missing values by a specific value should not affect the quality of the data. Four different models for dealing with missing data were studied. A framework is established that remove inconsistencies before and after filling the attributes of missing values with the new expected value as generated by one of the four models. Comparative results were discussed and recommendations were concluded.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Probabilistic Linkage of Persian Record with Missing Data

Extended Abstract. When the comprehensive information about a topic is scattered among two or more data sets, using only one of those data sets would lead to information loss available in other data sets. Hence, it is necessary to integrate scattered information to a comprehensive unique data set. On the other hand, sometimes we are interested in recognition of duplications in a data set. The i...

متن کامل

Flow Shop Scheduling Problem with Missing Operations: Genetic Algorithm and Tabu Search

Flow shop scheduling problem with missing operations is studied in this paper. Missing operations assumption refers to the fact that at least one job does not visit one machine in the production process. A mixed-binary integer programming model has been presented for this problem to minimize the makespan. The genetic algorithm (GA) and tabu search (TS) are used to deal with the optimization...

متن کامل

Investigating the missing data effect on credit scoring rule based models: The case of an Iranian bank

Credit risk management is a process in which banks estimate probability of default (PD) for each loan applicant. Data sets of previous loan applicants are built by gathering their data, and these internal data sets are usually completed using external credit bureau’s data and finally used for estimating PD in banks. There is also a continuous interest for bank to use rule based classifiers to b...

متن کامل

DEA with Missing Data: An Interval Data Assignment Approach

In the classical data envelopment analysis (DEA) models, inputs and outputs are assumed as known variables, and these models cannot deal with unknown amounts of variables directly. In recent years, there are few researches on handling missing data. This paper suggests a new interval based approach to apply missing data, which is the modified version of Kousmanen (2009) approach. First, the prop...

متن کامل

Image Classification via Sparse Representation and Subspace Alignment

Image representation is a crucial problem in image processing where there exist many low-level representations of image, i.e., SIFT, HOG and so on. But there is a missing link across low-level and high-level semantic representations. In fact, traditional machine learning approaches, e.g., non-negative matrix factorization, sparse representation and principle component analysis are employed to d...

متن کامل

مقایسه روش بیزی (Bayesian) و کلاسیک در برآرد پارامترهای مدل رگرسیون لجستیک با وجود مقادیر گمشده در متغیرهای کمکی

Background and Aim: Logistic regression is an analytic tool widely used in medical and epidemiologic research. In many studies, we face data sets in which some of the data are not recorded. A simple way to deal with such "missing data" is to simply ignore the subjects with missing observations, and perform the analysis on cases for which complete data are available. Materials and Methods: We c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006